class: center, middle, inverse, title-slide # Branching Out Into Isolation Forests ## R-Ladies Dallas ### Stephanie Kirmer
www.stephaniekirmer.com
@
data_stephanie
### December 7, 2020 --- # Follow Along! https://github.com/skirmer/isolation_forests --- # Introduction Isolation forests are a method using tree-based decisionmaking to separate observations instead of grouping them. You might visualize this in tree form: <img src="../IsolationForest1.png" alt="diagram1" width="600"/> --- # Introduction If you prefer to think about the points in two dimensional space, you can also use something like this:  Here you can see that a highly anomalous observation is easily separated from the bulk of the sample, while a non-anomalous one requires many more steps to isolate. --- # Getting Started Today we are going to implement this modeling approach using a sample of data from Spotify- song characteristics. We'll be using these libraries: * **modeling**: isotree, fastshap * **visuals**: ggplot2, plotly, patchwork --- # Load Data Our dataset: Spotify Tracks (via Kaggle) https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=data.csv ## Track Characteristics ``` ## [1] "acousticness" "artists" "danceability" "duration_ms" ## [5] "energy" "explicit" "id" "instrumentalness" ## [9] "key" "liveness" "loudness" "mode" ## [13] "name" "popularity" "release_date" "speechiness" ## [17] "tempo" "valence" "year" ``` --- # Looking at Examples .panelset[ .panel[.panel-name[Instrumental] ```r knitr::kable(head(dataset[dataset$instrumentalness > .94, c("artists", "name", "year")], 5)) ``` | |artists |name | year| |:--|:------------------------------------------|:-------------------------------------------|----:| |21 |['Moritz Moszkowski', 'Vladimir Horowitz'] |Etude in A-Flat, Op. 72, No. 11 | 1928| |23 |['Frédéric Chopin', 'Vladimir Horowitz'] |Andante spianato in E-Flat Major, Op. 22 | 1928| |27 |['Hafız Yaşar'] |Kız Saçların | 1928| |42 |['Dmitry Kabalevsky', 'Vladimir Horowitz'] |Sonata No. 3, Op. 46: II. Andante cantabile | 1928| |49 |['Shungi Music Crew'] |Rumours | 1928| ] .panel[.panel-name[Speechy] ```r set.seed(426) speechy = dataset[dataset$speechiness > .9 & dataset$year > 1965, c("artists", "name", "year")] knitr::kable(speechy[sample(nrow(speechy), 3), ]) ``` | |artists |name | year| |:-----|:----------------|:----------------------------------|----:| |23123 |['John Mulaney'] |Blacking Out and Making Money | 2009| |46975 |['John Mulaney'] |Law and Order and Mr. Jerry Orbach | 2009| |76943 |['John Mulaney'] |Crime News | 2009| ] .panel[.panel-name[Loud] ```r set.seed(400) loud = dataset[dataset$loudness > .85, c("artists", "name", "year")] knitr::kable(loud[sample(nrow(loud), 3), ]) ``` | |artists |name | year| |:------|:----------------|:---------------------------------|----:| |93132 |['The Stooges'] |Search and Destroy - Iggy Pop Mix | 1973| |108250 |['Apocolothoth'] |Sold | 1936| |127899 |['DYING SPASM'] |drag | 1944| ]] --- # Feature Engineering * Bin the years * Cut off songs pre-1960 |Var1 | Freq| |:--------------|-----:| |60s | 20000| |70s | 20000| |80s | 20000| |90s | 20000| |00s | 20000| |10s to present | 19656| --- # Function Syntax We don't need to set any outcome or dependent variable because that is not the objective of this algorithm. ```r iso_ext = isolation.forest( training_set[, features], ndim=2, ntrees=100, nthreads=1, max_depth = 6, prob_pick_pooled_gain=0, prob_pick_avg_gain=0, output_score = FALSE) Z1 <- predict(iso_ext, training_set) Z2 <- predict(iso_ext, test_set) training_set$scores <- Z1 test_set$scores <- Z2 ``` --- # Feature Importance <!-- --> --- # Peeking at Results .panelset[ .panel[.panel-name[Table of Tracks] | |artists |name | year| scores| |:------|:-----------------------------------|:-----------------------------------------|----:|---------:| |92291 |['Herb Alpert & The Tijuana Brass'] |Whipped Cream | 1965| 0.4596230| |7231 |['Spawnbreezie'] |Don't Let Go | 2011| 0.4429261| |135399 |['ILLENIUM', 'Jon Bellion'] |Good Things Fall Apart (with Jon Bellion) | 2019| 0.4415915| |61001 |['Bonnie "Prince" Billy'] |I See A Darkness | 1998| 0.4633557| |10548 |['Sammy Davis Jr.'] |Not for Me | 1964| 0.4405995| |12461 |['The The'] |Soul Mining | 1983| 0.4540688| ] .panel[.panel-name[Score Distribution] <img src="isoforests_files/figure-html/unnamed-chunk-11-1.png" width="600" /> ] .panel[.panel-name[Scatterplot (Training)] <img src="isoforests_files/figure-html/unnamed-chunk-12-1.png" width="650" /> ] .panel[.panel-name[Scatterplot (Test)] <img src="isoforests_files/figure-html/unnamed-chunk-13-1.png" width="600" /> ]] --- # PCA .panelset[ .panel[.panel-name[Component Choices] <img src="isoforests_files/figure-html/unnamed-chunk-14-1.png" width="650" /> ] .panel[.panel-name[3D Rendering]
]] --- # Other Exploration .panelset[ .panel[.panel-name[Decade Proportions] <img src="isoforests_files/figure-html/unnamed-chunk-17-1.png" width="700" /> ] .panel[.panel-name[Score Density] <img src="isoforests_files/figure-html/unnamed-chunk-18-1.png" width="700" /> ]] --- # Further Links/Reference https://ggplot2.tidyverse.org/ https://plotly.com/r/3d-scatter-plots/ https://github.com/david-cortes/isotree --- # Thank you! [www.stephaniekirmer.com](http://www.stephaniekirmer.com) | @[data_stephanie](http://www.twitter.com/data_stephanie) | [saturncloud.io](http://saturncloud.io)